11 research outputs found

    Stochastic Online Linear Regression: the Forward Algorithm to Replace Ridge

    Get PDF
    International audienceWe consider the problem of online linear regression in the stochastic setting. We derive high probability regret bounds for online ridge regression and the forward algorithm. This enables us to compare online regression algorithms more accurately and eliminate assumptions of bounded observations and predictions. Our study advocates for the use of the forward algorithm in lieu of ridge due to its enhanced bounds and robustness to the regularization parameter. Moreover, we explain how to integrate it in algorithms involving linear function approximation to remove a boundedness assumption without deteriorating theoretical bounds. We showcase this modification in linear bandit settings where it yields improved regret bounds. Last, we provide numerical experiments to illustrate our results and endorse our intuitions

    Stochastic Online Linear Regression: the Forward Algorithm to Replace Ridge

    Get PDF
    International audienceWe consider the problem of online linear regression in the stochastic setting. We derive high probability regret bounds for online ridge regression and the forward algorithm. This enables us to compare online regression algorithms more accurately and eliminate assumptions of bounded observations and predictions. Our study advocates for the use of the forward algorithm in lieu of ridge due to its enhanced bounds and robustness to the regularization parameter. Moreover, we explain how to integrate it in algorithms involving linear function approximation to remove a boundedness assumption without deteriorating theoretical bounds. We showcase this modification in linear bandit settings where it yields improved regret bounds. Last, we provide numerical experiments to illustrate our results and endorse our intuitions

    Learning Value Functions in Deep Policy Gradients using Residual Variance

    Get PDF
    Policy gradient algorithms have proven to be successful in diverse decision making and control tasks. However, these methods suffer from high sample complexity and instability issues. In this paper, we address these challenges by providing a different approach for training the critic in the actor-critic framework. Our work builds on recent studies indicating that traditional actor-critic algorithms do not succeed in fitting the true value function, calling for the need to identify a better objective for the critic. In our method, the critic uses a new state-value (resp. state-action-value) function approximation that learns the value of the states (resp. state-action pairs) relative to their mean value rather than the absolute value as in conventional actor-critic. We prove the theoretical consistency of the new gradient estimator and observe dramatic empirical improvement across a variety of continuous control tasks and algorithms. Furthermore, we validate our method in tasks with sparse rewards, where we provide experimental evidence and theoretical insights.Comment: Accepted at ICLR 202

    Online Sign Identification: Minimization of the Number of Errors in Thresholding Bandits

    Get PDF
    International audienceIn the fixed budget thresholding bandit problem, an algorithm sequentially allocates a budgeted number of samples to different distributions. It then predicts whether the mean of each distribution is larger or lower than a given threshold. We introduce a large family of algorithms (containing most existing relevant ones), inspired by the Frank-Wolfe algorithm, and provide a thorough yet generic analysis of their performance. This allowed us to construct new explicit algorithms, for a broad class of problems, whose losses are within a small constant factor of the non-adaptive oracle ones. Quite interestingly, we observed that adaptive methods empirically greatly out-perform non-adaptive oracles, an uncommon behavior in standard online learning settings, such as regret minimization. We explain this surprising phenomenon on an insightful toy problem

    Learning Value Functions in Deep Policy Gradients using Residual Variance

    Get PDF
    International audiencePolicy gradient algorithms have proven to be successful in diverse decision making and control tasks. However, these methods suffer from high sample complexity and instability issues. In this paper, we address these challenges by providing a different approach for training the critic in the actor-critic framework. Our work builds on recent studies indicating that traditional actor-critic algorithms do not succeed in fitting the true value function, calling for the need to identify a better objective for the critic. In our method, the critic uses a new state-value (resp. state-action-value) function approximation that learns the value of the states (resp. state-action pairs) relative to their mean value rather than the absolute value as in conventional actor-critic. We prove the theoretical consistency of the new gradient estimator and observe dramatic empirical improvement across a variety of continuous control tasks and algorithms. Furthermore, we validate our method in tasks with sparse rewards, where we provide experimental evidence and theoretical insights

    Apprentissage par renforcement réaliste

    No full text
    Dans cette thèse de doctorat, nous considérons le défi de rendre l'apprentissage par renforcement plus adapté aux problèmes du monde réel sans perdre les garanties théoriques. Il s'agit d'un domaine de recherche très actif, car l'application au monde réel est l'objectif final de cette littérature ainsi que la motivation première des cadres spécifiques de l'apprentissage par renforcement. Les garanties théoriques sont, comme leur nom l'indique, l'assurance que la théorie peut fournir sur la performance et la fiabilité de nos stratégies. Le développement de ce domaine est crucial pour améliorer les algorithmes RL interprétables. Notre travail est structuré autour de quatre contextes différents, nous commençons par une introduction au domaine et une revue générale de la littérature, y compris les bandits, les processus de Markov (MDP), certains objectifs d'apprentissage par renforcement, et quelques défis de RL réalistes.La thèse se poursuit en spécifiant divers scénarios spécifiques ainsi que différentes approches pour relever quelques défis pertinents du RL. Nous nous attaquons d'abord à un scénario séquentiel d'identification de signe pour les bandits à bras multiples, où nous concevons une méthode générique pour définir des algorithmes, une nouvelle stratégie de preuve fournissant des limites d'erreur. Ensuite, nous présentons de nouvelles observations comparant les algorithmes adaptatifs aux oracles hors ligne. Notre deuxième contribution est une amélioration théorique de la régression linéaire séquentielle pour des limites de regret améliorées et une stabilité accrue, nous nous sommes inspirés de résultats bien établis sur le cas adversatif et les avons adaptés au cadre stochastique, puis nous avons illustré les améliorations avec une application aux bandits linéaires. Une contribution significative de cette thèse est l'étude de la récente représentation de la famille exponentielle bilinéaire pour les MDPs à espaces continus. Nous avons pu faire des observations notables menant à des solutions explicites et à des garanties théoriques améliorées. Enfin, nous nous sommes attaqués au problème des gradients de politiques profondes où nous avons introduit une mesure d'erreur bien justifiée pour un apprentissage plus précis de la fonction de valeur. Le besoin de cette dernière amélioration a été fortement motivé par des travaux récents ainsi que par plusieurs expérimentations que nous avons fournies. En outre, l'évaluation expérimentale approfondie de notre nouvelle approche révèle une augmentation notable des performances, ce qui corrobore nos intuitions et valide nos affirmations.Les résultats de cette thèse démontrent un progrès substantiel dans la littérature RL, tant sur le plan pratique que théorique, offrant des perspectives et des solutions précieuses pour la communauté RL. Nous pensons que les méthodes proposées ont le potentiel de combler le fossé entre la théorie du RL motivée par les applications, faisant de cette thèse une contribution significative au domaine.This thesis explores the challenge of making reinforcement learning (RL) more suitable to real-world problems without loosing theoretical guarantees. This is an interesting active research area because real-world problems are the final goal and the first motivation for the different RL settings, and theoretical guarantees are, like the name suggests, the assurances that the theory can provide about the performance and reliability of our strategies. Developing this field is crucial for improving interpretable RL algorithms. Our work is structured around four different RL settings, and begins with an introduction to the field and a general review of relevant literature, including bandits, Markov Decision Processes (MDPs), a number of reinforcement learning objectives, and relevant realistic RL challenges.The thesis proceeds by specifying various specific scenarios as well as different approaches to address the relevant RL challenges. We first tackle an online sign identification setting for multi-armed bandits, where we investigate a generic method to design algorithms, a novel proof strategy providing SOTA error bounds, and we present unprecedented observations when comparing adaptive algorithms to offline oracles. Our second contribution is a theoretical improvement of sequential linear regression for improved regret bounds and increased stability, we took inspiration from well established results on sequential adversarial regression and adapted them to the stochastic setting, then we illustrated the improvements with an application to linear bandits. A significant contribution of this thesis is studying the recent bilinear exponential family representation for continuous MDPs, we were able to make notable observations leading to tractability and improved theoretical guarantees. Finally, we tackled the setting of deep policy gradients where we introduced a principled loss for a more accurate value function learning, the need for this improvement was strongly motivated by recent work as well a several experiments that we provided. Also, the extensive experimental evaluation of our new approach reveals a significant performance boost corroborating our insights and validating our claims.The results of this research demonstrate substantial progress in the RL literature both practically theoretically, offering valuable insights and solutions for the RL community. We believe that the proposed methods show the potential to close the gap between purely theoretical RL and applications-motivated RL, making this thesis a significant contribution to the field

    Stochastic Online Linear Regression: the Forward Algorithm to Replace Ridge

    No full text
    International audienceWe consider the problem of online linear regression in the stochastic setting. We derive high probability regret bounds for online ridge regression and the forward algorithm. This enables us to compare online regression algorithms more accurately and eliminate assumptions of bounded observations and predictions. Our study advocates for the use of the forward algorithm in lieu of ridge due to its enhanced bounds and robustness to the regularization parameter. Moreover, we explain how to integrate it in algorithms involving linear function approximation to remove a boundedness assumption without deteriorating theoretical bounds. We showcase this modification in linear bandit settings where it yields improved regret bounds. Last, we provide numerical experiments to illustrate our results and endorse our intuitions

    Bilinear Exponential Family of MDPs: Frequentist Regret Bound with Tractable Exploration & Planning

    No full text
    International audienceWe study the problem of episodic reinforcement learning in continuous stateaction spaces with unknown rewards and transitions. Specifically, we consider the setting where the rewards and transitions are modeled using parametric bilinear exponential families. We propose an algorithm, BEF-RLSVI, that a) uses penalized maximum likelihood estimators to learn the unknown parameters, b) injects a calibrated Gaussian noise in the parameter of rewards to ensure exploration, and c) leverages linearity of the exponential family with respect to an underlying RKHS to perform tractable planning. We further provide a frequentist regret analysis of BEF-RLSVI that yields an upper bound of Õ( (d^3 H^3 K)^{1/2} ), where d is the dimension of the parameters, H is the episode length, and K is the number of episodes. Our analysis improves the existing bounds for the bilinear exponential family of MDPs by √H and removes the handcrafted clipping deployed in existing RLSVI-type algorithms. Our regret bound is order-optimal with respect to H and K

    Online Sign Identification: Minimization of the Number of Errors in Thresholding Bandits

    Get PDF
    International audienceIn the fixed budget thresholding bandit problem, an algorithm sequentially allocates a budgeted number of samples to different distributions. It then predicts whether the mean of each distribution is larger or lower than a given threshold. We introduce a large family of algorithms (containing most existing relevant ones), inspired by the Frank-Wolfe algorithm, and provide a thorough yet generic analysis of their performance. This allowed us to construct new explicit algorithms, for a broad class of problems, whose losses are within a small constant factor of the non-adaptive oracle ones. Quite interestingly, we observed that adaptive methods empirically greatly out-perform non-adaptive oracles, an uncommon behavior in standard online learning settings, such as regret minimization. We explain this surprising phenomenon on an insightful toy problem

    Online Sign Identification: Minimization of the Number of Errors in Thresholding Bandits

    No full text
    International audienceIn the fixed budget thresholding bandit problem, an algorithm sequentially allocates a budgeted number of samples to different distributions. It then predicts whether the mean of each distribution is larger or lower than a given threshold. We introduce a large family of algorithms (containing most existing relevant ones), inspired by the Frank-Wolfe algorithm, and provide a thorough yet generic analysis of their performance. This allowed us to construct new explicit algorithms, for a broad class of problems, whose losses are within a small constant factor of the non-adaptive oracle ones. Quite interestingly, we observed that adaptive methods empirically greatly out-perform non-adaptive oracles, an uncommon behavior in standard online learning settings, such as regret minimization. We explain this surprising phenomenon on an insightful toy problem
    corecore